Development and Performance Analysis of a Fault Tolerant Algorithm for Cluster of Workstations

نویسندگان

  • Syed Misbahuddin
  • Nizar Al-Holou
چکیده

A Cluster of Workstations (COW) is network based multi-computer system, which is the most prominent distributed memory system aimed to replace supercomputers. A cluster of workstations can be viewed as a single machine in which one job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks assigned to component workstations must be completed. Therefore, for satisfactory job completion, all workstations must be functional. However, a faulty node can suspend the over all job completion task until. Therefore, a job can not be completed until a faulty node is recovered from fault. This paper presents a fault tolerant architecture for COW, which will allow a normally working workstation to perform the tasks of the faulty workstation in addition to its original assignments. The Markov models are basic tools applied for availability modeling. This paper presents a Markov Availability model for estimating the availability of component workstations as a function of workstation failure rates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scheduling Large Task Graphs in Parallel Using a Fault-Tolerant Heterogeneous-Cluster-Based Search

—A natural approach for scheduling tasks to a workstation cluster is to employ the multiple machines in the cluster to schedule the task graphs so that the cluster manifests itself as a “self-scheduled” platform. A few parallel approaches have been devised for scheduling task graphs using a parallel machine such as an Intel Paragon but they are not suitable for a cluster of workstations environ...

متن کامل

CAFT: Cost-aware and Fault-tolerant routing algorithm in 2D mesh Network-on-Chip

By increasing, the complexity of chips and the need to integrating more components into a chip has made network –on- chip known as an important infrastructure for network communications on the system, and is a good alternative to traditional ways and using the bus. By increasing the density of chips, the possibility of failure in the chip network increases and providing correction and fault tol...

متن کامل

Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple chec...

متن کامل

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...

متن کامل

Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems

some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004